October 30, 2022

Serious SQL Health Analytics

Photo credit: Danny Ma @ Data with Danny

Introducing Serious SQL by Danny Ma - A comprehensive SQL learning experience.
Link to GitHub repository, including data and final output.

About the data set
Health data measured by users using their wearables and enter into a tracking device. Columns in the data table "user_logs" in the "health" schema are:

  • id: id of the users
  • log_date: the date the record is entered
  • measure: the type of measures, can be weight, blood glucose and blood pressure
  • measure_value: measure values for weight and blood glucose
  • systolic and diastolyc: measure values for blood pressure
  • Photo credit: Danny Ma @ Data with Danny

    Practice questions

    • How many unique users exist in the logs dataset?
      select
      	count (distinct (id)) as unique_id
      	from health.user_logs;
      Answer: 554

    • How many duplicated records are there?
      with cte as (
      	select
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic,
      		count(*) as count_records
      	from health.user_logs
      	group by
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic
      	)
      select
      	sum(count_records)
      from cte
      where count_records > 1;
      Answer: 19,449

    • How many unquie records are there?
      with cte as (
      	select
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic,
      		count(*) as count_records
      	from health.user_logs
      	group by
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic
      	)
      select
      	sum(count_records)
      from cte
      where count_records = 1;
      Answer: 24,442

    • Who are the users with the most duplicated records?
      with cte as (
      	select
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic,
      		count(*) as count_records
      	from health.user_logs
      	group by
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic
      	)
      select
      	id,
      	sum(count_records) as sum_records
      from cte
      where count_records > 1
      group by id
      order by sum_records desc;
      Answer: 054250c692e07a9fa9e62e345231df4b54ff435d with 17279 duplicated records

    • How many duplicated records in each measure category?
      with cte as (
      	select
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic,
      		count(*) as count_records
      	from health.user_logs
      	group by
      		id,
      		log_date,
      		measure,
      		measure_value,
      		systolic,
      		diastolic
      	)
      select
      	measure,
      	sum(count_records) as sum_records
      from cte
      where count_records > 1
      group by measure
      order by sum_records desc;
      Answer: blood_glucose = 18,471; weight = 677; blood_pressure = 301

    • Ignore duplications, for weight, what are the minimum and maximum records?
      select
      	max(measure_value) as max_rec,
      	min(measure_value) as min_rec
      from health.user_logs
      where measure = 'weight';
      Answer: 39,642,120 and 0

    • Ignore duplications, allocate weights in to 100 percentitles and find outliers.
       select
      	measure_value,
      	ntile(100) over(
      		order by measure_value
      	) as percentile
      from health.user_logs
      where measure = 'weight';
      Answer: ouliers are out side the range of 1.8 to 201

    • Ignore duplications and exclude outliers, what are the avg, median, mode, stde of weight measures?
      select
      	round(
      		avg(measure_value),
      		2
      	) as mean_value,
      	round(
      		cast(
      			percentile_cont(0.5) within group (
      				oder by measure_value
      				) as numeric
      			),
      		2
      	) as median_value,
      	round(
      		mode() within group (
      			order by measure_value
      			),
      		2
      	) as mode_value,
      	round(
      		stddev(measure_value), 
      		2
      	) as standard_deviation
      from health.user_logs
      where 
      	measure = 'weight' and 
      	(measure_value between 1.8 and 201);
      Answer: mean_value/median_value/mode_value/standard_deviation = 80.76/75.98/68.49/26.91

    • How many total measurements do we have per user on average?
      with cte as(
      	select
      		id,
      		count (*) as frequency
      	from health.user_logs
      	group by id
      	)
      select
      	round(
      		sum (frequency)::numeric / count (distinct id),
      		2
      	) as record_per_customer
      from cte;
      Answer: 79.23

    • What about the median number of measurements per user?
      with cte as(
      	select
      		id,
      		count (*) as frequency
      	from health.user_logs
      	group by id
      	)
      select
      	round(percentile_cont(0.5) within group (
      		order by frequency)
      		::numeric,
      	2) as median
      from cte;
      Answer: 2

    • How many users have 3 or more measurements?
      with cte as(
      	select
      		id,
      		count (*) as frequency
      	from health.user_logs
      	group by id
      	)
      select
      	count (id)
      from cte
      where frequency >= 3; 
      Answer: 209

    • How many users have 1,000 or more measurements?
      with cte as(
      	select
      		id,
      		count (*) as frequency
      	from health.user_logs
      	group by id
      	)
      select
      	count (id)
      from cte
      where frequency >= 1000;
      Answer: 5

    • How many users have logged blood glucose measurements?
      with cte as(
      	select
      		id,
      		measure,
      		count (*) as frequency
      	from health.user_logs
      	group by 
      		id, 
      		measure
      	)
      select
      	count (id)
      from cte
      where measure = 'blood_glucose';
      Answer: 325

    • How many user have at least 2 types of measurements?
      with cte as(
      	select
      		id,
      		count (distinct measure) as frequency
      	from health.user_logs
      	group by id
      	)
      select
      	count (id)
      from cte
      where frequency >=2; 
      Answer: 204

    • How many users have all 3 measures - blood glucose, weight and blood pressure?
      with cte as(
      	select
      		id,
      		count (distinct measure) as frequency
      	from health.user_logs
      	group by id
      	)
      select
      	count (id)
      from cte
      where frequency =3;
      Answer: 50

    • For users that have blood pressure measurements, what is the median systolic/diastolic blood pressure values?
      select
      	percentile_cont(0.5) within group (
      		order by systolic)
      		::numeric 
      	as median_systolic,
      	percentile_cont(0.5) within group (
      		order by diastolic)
      		::numeric 
      	as median_diastolic
      from health.user_logs
      where 
      	(measure = 'blood_pressure') and 
      	(systolic is not null) and 
      	(diastolic is not null);
      Answer: 129/79